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Abstract 

The task of inferring a set of classes and class 
descriptions most likely to explain a given 
data set can be placed on a firm theoretical 
foundation using Bayesian statistics. Within 
this framework, and using various mathemat- 
ical and algorithmic approximations, the Au- 
toClass system searches for the most proba- 
ble classifications, automatically choosing the 
number of classes and complexity of class de- 
scriptions. A simpler version of AutoClass has 
been applied to many large real data sets, have 
discovered new independently-verified phenom- 
ena, and have been released as a robust soft- 
ware package. Recent extensions allow at- 
tributes to be selectively correlated within par- 
ticular classes, and allow classes to inherit, or 
share, model parameters though a class hierar- 
chy. In this paper we summarize the mathe- 
matical foundations of Autoclass. 

1 Introduction 

The task of supervised classification - i.e., learning to pre- 
dict class memberships of test cases given labeled train- 
ing cases - is a familiar machine learning problem. A re- 
lated problem is unsupervised classification, where train- 
ing cases are also unlabeled. Here one tries to predict all 
features of new cases; the best classification is the least 
“surprised” by new cases. This type of classification, 
related to clustering, is often very useful in exploratory 
data analysis, where one has few preconceptions about 
what structures new data may hold. 

We have previously developed and reported on Au- 
toClass [Cheeseman et aZ., 1988a; Cheeseman et aZ., 
1988b], an unsupervised classification system based on 
Bayesian theory. Rather than just partitioning cases, 
as most clustering techniques do, the Bayesian approach 
searches in a model space for the “best” class descrip- 
tions. A best classification optimally trades off predic- 
tive accuracy against the complexity of the classes, and 
so does not “overfit” the data. Such classes are also 
“fuzzy”; instead of each case being assigned to a class, a 
case has a probability of being a member of each of the 
different classes. 
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Autoclass III, the most recent released version, com- 
bines real and discrete data, allows some data to be miss- 
ing, and automatically chooses the number of classes 
from first principles. Extensive testing has indicated 
that it generally produces significant and useful results, 
but is primarily limited by the simplicity of the mod- 
els it uses, rather than, for example, inadequate search 
heuristics. AutoClass III assumes that all attributes are 
relevant, that they are independent of each other within 
each class, and that classes are mutually exclusive. Re- 
cent extensions, embodied in Autoclass IV, let us relax 
two of these assumptions, allowing attributes to be se- 
lectively correlated and to have more or less relevance 
via a class hierarchy. 

This paper summarizes the mathematical foundations 
of AutoClass, beginning with the Bayesian theory of 
learning, and then applying it to increasingly complex 
classification problems, from various single class mod- 
els up to hierarchical class mixtures. For each problem, 
we describe our assumptions in words and mathematics, 
and then give the resulting evaluation and estimation 
functions for comparing models and making predictions. 
The derivations of these results from these assumptions, 
however, are not given. 

2 Bayesian Learning 

Bayesian theory gives a mathematical calculus of degrees 
of belief, describing what it means for beliefs to be con- 
sistent and how they should change with evidence. This 
section briefly reviews that theory, describes an approach 
to making it tractable, and comments on the resulting 
tradeoffs. In general, a Bayesian agent uses a single read 
number to describe its degree of belief in each proposition 
of interest. This assumption, together with some other 
assumptions about how evidence should affect beliefs, 
leads to the standard probability axioms. This result 
was originally proved by Cox [Cox, 1946] and has been 
reformulated for an AI audience [Heckerman, 1990]. We 
now describe this theory. 

2.1 Theory 

Let E denote some evidence that is known or could po- 
tentially be known to an agent; let H denote a hypothe- 
sis specifying that the world is in some particular state; 
and let the sets of possible evidence E and possible states 
of the world H each be mutually exclusive and exhaus- 
tive sets. For example, if we had a coin that might be 



two-headed the possible states of the world might be 
"ordinary coin”, "two-headed coin”. If we were to toss 
it once the possible evidence would be "lands heads”, 
"lands tails". 

In general, P(ab\cd) denotes a real number describing 
an agent’s degree of belief in the conjunction of proposi- 
tions a and b, conditional on the assumption that propo- 
sitions c and d are true. The propositions on either side 
of the conditioning bar "|” can be arbitrary Boolean ex- 
pressions. More specifically, tt(H) is a “prior" describing 
the agent’s belief in H before, or in the absence of, see- 
ing evidence E, v{H\E) is a “posterior" describing the 
agent’s belief after observing some particular evidence 
E, and L(E\H) is a “likelihood” embodying the agent’s 
theory of how likely it would be to see each possible ev- 
idence combination E in each possible world H . 

To be consistent, beliefs must be non-negative, 0 < 
P(a\b) < 1, and normalized, so that **(#) = 1 and 
^2 e L(E\H) = 1. That is, the agent is sure that the 
world is in some state and that some evidence will be 
observed. The likelihood and the prior together give a 
“joint" probability J(EH) == L(E\H)*(H) of both E 
and H. Normalizing the joint gives Bayes’ rule, which 
tells how beliefs should change with evidence; 

- EB gMg) 

When the set of possible Hs is continuous, the prior 
tt(H) becomes a differential dir(H), and the sums over 
H are replaced by integrals. Similarly, continuous Es 
have a differential likelihood dL(E\H), though any real 
evidence AE will have a finite probability AZ(25|lir) « 
dL(E\H)±§. 

In theory, all an agent needs to do in any given situ- 
ation is to choose a set of states H, an associated like- 
lihood function describing what evidence is expected to 
be observed in those states, a set of prior expectations 
on the states, and then collect some relevant evidence. 
Bayes’ rule then specifies the appropriate posterior be- 
liefs about the state of the world, which can be used to 
answer most questions of interest. An agent can combine 
these posterior beliefs with its utility over states U(H), 
which says how much it prefers each possible state, to 
choose an action A which maximizes its expected utility 

EU(A) = Yj U{H)ir{H\EA). 

H 


2.2 Practice 

In practice this theory can be difficult to apply, as the 
sums and integrals involved are often mathematically in- 
tractable. So one must use approximations. Here is our 
approach. 

Rather than consider all possible states of the world, 
we focus on some smaller space of models , and do all 
of our analysis conditional on an assumption S that the 
world really is described by one of the models in our 
space. As with most modeling, this assumption is almost 
certainly false, but it makes the analysis tractable. With 
time and effort we can make our models more complex, 
expanding our model space in order to reduce the effect 
of this simplification. 


The parameters which specify a particular model are 
split into two sets. First, a set of discrete parameters T 
describe the general form of the model, usually by spec- 
ifying some functional form for the likelihood function. 
For example, T might specify whether two variables are 
correlated or not, or how many classes are present in a 
classification. Second, free variables in this general form, 
such as the magnitude of the correlation or the relative 
sizes of the classes, constitute the remaining continuous 
model parameters V. 

We generally prefer a likelihood 1 L(E\VTS) which is 
mathematically simple and yet still embodies the kinds 
of complexity we believe to be relevant. 

Similarly, we prefer a simple prior distribution 
dir(VT\S ) over this model space, allowing the result- 
ing V integrals, described below, to be at least approx- 
imated. A prior that predicts the different parameters 
in V independently, through a product of terms for each 
different parameter, often helps. We also prefer the prior 
to be as broad and uninformative as possible, so our soft- 
ware can be used in many different problem contexts, 
though in principal we could add specific domain knowl- 
edge through an appropriate prior. Finally we prefer a 
prior that gives nearly equal weight to different levels 
of model complexity, resulting in a “significance test”. 
Adding more parameters to a model then induces a cost, 
which must be paid for by a significantly better fit to the 
data before the more complex model is preferred. 

Sometimes the integrable priors are not broad enough, 
containing meta-parameters which specify some part of 
model space to focus on, even though we have no prior 
expectations about where to focus. In these cases we 
“cheat” and use simple statistics collected from the evi- 
dence we are going to use, to help set these priors 2 . For 
example, see Sections 4.2, 4.5. 

The joint can now be written as dJ(EVT\S) = 
L(E\VTS)dir(VT\S) and, for a reasonably-complex 
problem, is usually a very rugged distribution in VT, 
with an immense number of sharp peaks distributed 
widely over a huge high-dimensional space. Because of 
this we despair of directly normalizing the joint, as re- 
quired by Bayes’ rule, or of communicating the detailed 
shape of the posterior distribution. 

Instead we break the continuous V space into regions 
R surrounding each sharp peak, and search until we tire 
for combinations RT for which the “marginal" joint 

M(ERT\S) = [ dJ(EVT\S) 

JveR 

is as large as possible. The best few such “models” RT 
are then reported, even though it is usually almost cer- 
tain that more probable models remain to be found. 

Each model RT is reported by describing its marginal 
joint M(E RT\S), its discrete parameters T, and esti- 
mates of typical values of V in the region R, like the 
mean estimate of V: 


£(V\ ERTS) = 


f V( . R VdJ(EVT\S) 

M(ERT\S) 


*Note that when a variable like V sits in a probability ex- 
pression where a proposition should be, it stands for a propo- 
sition that the variable has a particular value. 

2 This is cheating because the prior is supposed to be in- 
dependent of evidence. 
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or the V for which dJ(EVT\S) is maximum in R . While 
these estimates are not invariant under reparameteriza- 
tions of the V space, and hence depend on the syntax 
with which the likelihood was expressed, the peak is usu- 
ally sharp enough that such differences don’t matter. 

Reporting only the best few models is usually justified, 
since the models weaker than this are usually many or- 
ders of magnitude less probable than the best one. The 
main reason for reporting models other than the best is 
to show the range of variation in the models, so that one 
can judge how different the better, not yet found, models 
might be. 

The decision to stop searching for better models RT 
than the current best can often be made in a principled 
way by using estimates of how much longer it would 
take to find a better model, find how much better than 
model would be. If the fact that a data value is un- 
known might be informative, one can model “unknown” 
as just another possible (discrete) data value; otherwise 
the likelihood for an unknown value is just a sum over 
the possible known values. 

To make predictions with these resulting models, a 
reasonable approximation is to average the answer from 
the best few peaks, weighted by the relative marginal 
joints. Almost all of the weight is usually in the best 
few, justifying the neglect of the rest. 

2.3 Tradeoffs 

Bayesian theory offers the advantages of being theoret- 
ically well-founded and empirically well-tested [Berger, 
1985]. It offers a clear procedure whereby one can almost 
“turn the crank”, modulo doing integrals and search, to 
deal with any new problem. The machinery automati- 
cally trades off the complexity of a model against its fit 
to the evidence. Background knowledge can be included 
in the input, and the output is a flexible mixture of sev- 
eral different “answers,” with a clear and well-founded 
decision theory [Berger, 1985] to help one use that out- 
put. 

Disadvantages include being forced to be explicit 
about the space of models one is searching in, though 
this can be good discipline. One must deal with some 
difficult integrals and sums, although there is a huge lit- 
erature to help one here. And one must often search 
large spaces, though most any technique will have to do 
this and the joint probability provides a good local eval- 
uation function. Finally, it is not clear how one can take 
the computational cost of doing a Bayesian analysis into 
account without a crippling infinite regress. 

Some often perceived disadvantages of Bayesian anal- 
ysis are really not problems in practice. Any ambiguities 
in choosing a prior are generally not serious, since the 
various possible convenient priors usually do not disagree 
strongly within the regions of interest. Bayesian analysis 
is not limited to what is traditionally considered “statis- 
tical” data, but can be applied to any space of models 
about how the world might be. For a general discussion 
of these issues, see [Cheeseman, 1990]. 

We will now illustrate this general approach by apply- 
ing it to the problem of unsupervised classification. 


3 Model Spaces Overview 

3.1 Conceptual Overview 

In this paper we deal only with at tribute- value, not re- 
lational, data. 3 For example, medical cases might be 
described by medical forms with a standard set of en- 
tries or slots. Each slot could be filled only by elements 
of some known set of simple values, like numbers, colors, 
or blood-types. (In this paper, we will only deal with 
real and discrete attributes.) 

We would like to explain this data as consisting of a 
number of classes, each of which corresponds to a dif- 
fering underlying cause for the symptoms described on 
the form. For example, different patients might fall into 
classes corresponding to the different diseases they suffer 
from. 

To do a Bayesian analysis of this, we need to make 
this vague notion more precise, choosing specific math- 
ematical formulas which say how likely any particular 
combination of evidence would be. A natural way to do 
this is to say that there are a certain number of classes, 
that a random patient has a certain probability to come 
from each of them, and that the patients are distributed 
independently - once we know all about the underlying 
classes then learning about one patient doesn’t help us 
learn what any other patient will be like. 

In addition, we need to describe how each class is dis- 
tributed. We need a “single class” model saying how 
likely any given evidence is, given that we know what 
class the patient comes from. Thus we build the multi- 
class model space from some other pre-existing model 
space, which can be arbitrarily complex. (In fact, much 
of this paper will be spend describing various single class 
models.) In general, the more complex each class can be, 
the less of a need there is to invoke multiple classes to 
explain the variation in the data. 

The simplest way to build a single-class model is to 
predict each attribute independently, i.e., build it from 
attribute-specific models. A class has a distribution for 
each attribute and, if you know the class of a case, learn- 
ing the values of one attribute doesn’t help you predict 
the value of any other attributes. For real attributes one 
can use a standard normal distribution, characterized 
by some specific mean and a variance around that mean. 
For discrete attributes one can use the standard multino- 
mial distribution, characterized by a specific probability 
for each possible discrete value. 

Up to this point we have described the model space of 
Autoclass III. Autoclass IV goes beyond this by intro- 
ducing correlation and inheritance. Correlation is intro- 
duced by removing the assumption that attributes are 
independent within each class. The simplest way to do 
this is to let all real attributes covary, and let all discrete 
attributes covary. The standard way for real attributes 
to covary is the multivariate normal, which basically says 
that there is some other set of attributes one could de- 
fine, as linear combinations of the attributes given, which 
vary independently according to normal distributions. A 
simple way to let discrete attributes covary is to define 
one super-attribute whose possible values are all possible 

3 Nothing in principle prevents a Bayesian analysis of more 
complex model spaces that predict relational data. 
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combinations of the values of the attributes given. 

If there are many attributes, the above ways to add 
correlation introduce a great many parameters in the 
models, making them very complex and, under the usual 
priors, much less preferable than simpler independent 
models. What we really want are simpler models which 
only allow partial covariance. About the simplest way 
to do this is to say that, for a given class, the attributes 
clump together in blocks of inter-related attributes. All 
the attributes in a block covary with each other, but not 
with the attributes in other blocks. Thus we can build 
a block model space from the covariant model spaces. 

Even this simpler form of covariance introduces more 
parameters that the independent case, and when each 
class must have its own set of parameters, multiple 
classes are penalized more strongly. Attributes which 
are irrelevant to the whole classification, like a medi- 
cal patient’s favorite color, can be particularly costly. 
To reduce this cost, one can allow classes to share the 
specification of parameters associated with some of their 
independent blocks. Irrelevant attributes can then be 
shared by all classes at a minimum cost. 

Rather than allow arbitrary combinations of classes 
to share blocks, it is simpler to organize the classes as 
leaves of a tree. Each block can be placed at some node 
in this tree, to be shared by all the leaves below that 
node. In this way different attributes can be explained 
at different levels of an abstraction hierarchy. For med- 
ical patients the tree might have “viral infections” near 
the root, predicting fevers, and some more specific viral 
disease near the leaves, predicting more disease specific 
symptoms. Irrelevant attributes like favorite-color would 
go at the root. 


propriate, comments will be made about algorithms and 
computational complexity. All of the likelihood func- 
tions considered here assume the cases are independent, 
i.e., 

L(E\VTS) = JJl(JS?,|VTS) 

t 

so we need only give L(Ei\VTS) for each space, where 
Ei = {XiuXi2,Xi* 9 ...,X iK }. 

4 Single Class Models 

4.1 Single Discrete Attribute - Sdi 

A discrete attribute k allows only a finite number of pos- 
sible values Z G [1, 2, ..., X] for any X “Unknown” is usu- 
ally treated here as just another possible value. A set of 
independent coin tosses, for example, might have L = 3 
with h — heads, I 2 = tails, and Z3 = “unknown”. We 
make the assumption Sdi that there is only one discrete 
attribute, and that the only parameters are the continu- 
ous parameters V = qi . . . q& consisting of the likelihoods 
L(Xi\VSDi) = <l(i=Xi) for each possible value Z. In the 
coin example, qi = .7 would say that the coin was so 
“unbalanced” that it has a 70 percent chance of coming 
up heads each time. 

There fire only L — 1 free parameters since normal- 
ization requires YliW = 1* For this likelihood, all that 
matters from the data are the number of cases with each 
value 5 I] = co * n example, Ii would be 

the number of heads. Such sums are called “sufficient 
statistics” since they summarize all the information rel- 
evant to a model. 

We choose a prior 


3.2 Notation Summary 

For all the models to be considered in this paper, the 
evidence E will consist of a set of I cases, an associated 
set K of attributes, of size 4 if, and case attribute values 
Xiky which can include “unknown.” For example, medi- 
cal case number 8, described as (age = 23, blood-type = 
A, . . .), would have X&,i = 23, As, 2 = A, etc. 

In the next two sections we will describe applications 
of Bayesian learning theory to various kinds of mod- 
els which could explain this evidence, beginning with 
simple model spaces and building more complex spaces 
from them. We begin with a single class. First, a sin- 
gle attribute is considered, then multiple independent 
attributes, then fully covariant attributes, and finally 
selective covariance. In the next section we combine 
these single classes into class mixtures. Table 1 gives 
an overview of the various spaces. 

For each space S we will describe the continuous 
parameters V", any discrete model parameters T, nor- 
malized likelihoods dL(E\VTS ), and priors dir(VT\S ). 
Most spaces have no discrete parameters T, and only one 
region 72, allowing us to usually ignore these parameters. 
Approximations to the resulting marginals M(ERT\S) 
and estimates £(V\ ERTS) will be given, but not de- 
rived. These will often be given in terms of general func- 
tions F, so that they may be reused later on. As ap- 

4 Note we use script letters like K for sets, and matching 

ordinary letters K to denote their size. 


*r(V|Si») = dB( qi ...q L \L) = M J[qf-'d qi 


which for a > 0 is a special case of a beta distribu- 
tion [Berger, 1985] (T(y) is the Gamma function [Spiegel, 
1968]). This formula is parameterized by a, a “hyperpa- 
rameter” which can be set to different values to specify 
different priors. Here we set a = 1/i. This simple prob- 
lem has only one maximum, whose marginal is given by 


M(E\S D1 ) = F 1 (Iu...,I L ,I,L) = 


T(aL) YIj T(Ii + a) 

r(a£ + I)r(a)* 


We have abstracted the function F x , so we can refer to it 
later. The prior above was chosen because it has a form 
similar to the likelihood (and is therefore a ” conjugate” 
prior), and to make the following mean estimate of qi 
particularly simple 


for a ~ l/L. Fo is also abstracted out for use later. 
Note that while F^h, I, L) is very similar to the classical 
estimate of F> is defined even when 7 = 0. Using a 
hash table, these results can be computed in order 7 
numerical steps, independent of L. 


5 Note that 6 UV denotes 1 when u = v and 0 otherwise. 
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Space 

Description 

V 

T 

R 

Subspaces 

Compute Time 

Sd i 

Single Discrete 

Qt 




i 

Sr i 

Single Real 

pa 




i 

Si 

Independent Attrs 

V k 



Si = Sdi or $ri 

IK 

Sd 

Covariant Discrete 

Qhh... 




IK 

Sr 

Covariant Real 

MJfcEfc*/ 




JJTkW 1 

Sv 

Block Covariance 

v b 

BK 6 


Sr 5 Sd or Sr 

NK(IKl + K*) 

Sm 

Flat Class Mixture 

CCcVc 

C 

R 

Sc = Si or Sv 

NKC(IK b + K2) 

Sh 

Tree Class Mixture 

<XcV c 

J c fc c T c 

R 

Sc = Si or Sv 

NKC(IK b 4- Kf) 


Table 1: Model Spaces 


4.2 Single Real Attribute - Sri 

Real attribute values Xi specify a small range of the real 
line, with a center and a precision, Arc,-, assumed to be 
much smaller than other scales of interest. For example, 
someone’s weight might be measured as 70±1 kilograms. 
For scalar attributes, which can only be positive, like 
weight, it is best to use the logarithm of that variable 
[Aitchison and Brown, 1957]. 

For Sri , where there is only one real attribute, we 
assume the standard normal distribution, where the suf- 
ficient statistics are the data mean x — - the ge- 

ometric mean precision Ax = Ax*)t and the stan- 
dard deviation a given by s 2 = j Zi — x) 2 . V consists 

of a model mean p and deviation cr, and the likelihood 
is given by the standard normal distribution. 

dL{xi\VSm) = 

V2 ira 

For example, people’s weight might be distributed with 
a mean of 80 kilograms and a deviation of 15. Since 
all real data have a finite width, we replace dx with 
Ax to approximate the likelihood AL(Xi\VS R i) = 
J Ax dL( Xi \VS m ) St % dL(xi\VSm). 

As usual, we choose priors that treat the parameters 
in V independently. 

dir(V\S R i) = dTr(fjL\SRi)dTc((r\SRi) 

We choose a prior on the mean to be flat in the range of 
the data, 

dir(/i|5ni) = dR(n\n + ,ii~) 

where p + = max /z~ = minx,-, by using the general 
uniform distribution 

diZ(y|y + , j/”) = _ for y € [jT,|/ + ]. 

y + -y 

A flat prior is preferable because it is non-informative, 
but note that in order to make it normalizable we must 
cheat and use information from the data to cut it off at 
some point. In the single attribute case, we can similarly 
choose a flat prior in log(<r). 

dr(cr\S R1 ) = dR(\og(a)\log(Ap), log(min Ax,)) 

where A ft = p + — /i~. The posterior again has just one 
peak, so there is only one region IE, and the resulting 
marginal is 


M(E\S Rl ) = 


i a^_ 

2 ( 7 r /)£ log(Ap/ mm Ax,) s I ~ 1 Ap* 


Note that this joint is dimensionless. The estimates are 
simply £(ji|2£Sjji) = x, and £(a\E) = Com- 

putation here takes order I steps, used to compute the 
sufficient statistics. 


4.3 Independent Attributes - Si 

We now introduce some notation for collecting sets of 
indexed terms like X ^ . A single such term inside a {} 
will denote the set of all such indexed terms collected 
across all of the indices, like i and k in E = {X ik } = 
{X.jt such that i E [1, . . ., J], k £ £}. To collect across 
only some of the indices we use \J k as in 23* = \J k Xik = 
{Xu, Xi 2 , . . all the evidence for a single case i. 

The simplest way to deal with cases having multiple 
attributes is to assume Si that they are all independent, 
i.e., treating each attribute as if it were a separate prob- 
lem. In this case, the parameter set V partitions into 
parameter sets V* = (Ji* ©* or [MfciO'fcL depending on 
whether that k is discrete or real. The likelihood, prior, 
and joint for multiple attributes are all simple products 
of the results above for one attribute: Si = Sri or Sri 
— i.e., 

HEilVSj) = n L{X ik |F fc S0, 

k 

dx(v\s T ) = n 

k 

and 

M(E\S I ) = J[j(E(k)\S 1 ) 

k 

where E(k) = evidence associated with 

attribute k . The estimates £(Vk\ESi) = £(Vk\E(k)Si) 
are exactly the same. Computation takes order IK steps 
here. 


4.4 Fully Covariant Discretes - Sd 

A model space Sd which allows a set K of discrete at- 
tributes to fully covary (i.e, contribute to a likelihood in 
non-trivial combinations) can be obtained by treating all 
combinations of base attribute values as particular val- 
ues of one super attribute, which then has V — Lk 
values — so V can be a very large number! V consists 
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of terms like <7^/.,.../*, indexed by all the attributes. Jj 
generalizes to 

= ^ > TT ^Xikth • 

i k 

Given this transformation, the likelihoods, etc. look the 
same as before: 


L(Ei | VS D ) = u lh ..j K9 


where each = -X,*, 


as the prior. This choice makes the resulting integrals 
manageable, but requires us to choose an h and all the 
components of G***. We choose h = K to make the 
prior as broad as possible, and for G we “cheat” and 
choose Gkk 1 = S kk&kk* in order to avoid overly distorting 
the resulting marginal 


M(E\S r ) = 


K r(i±^L) 

Ilo r(i±t=i) |G feJ t»| 

K{ 1 - 1 ) , 

T 3 IIS**' + Gjtfc/| 


yK 

J-s- 


j: 


n 


A/ijt 


di:(V\S D )=dB(iq llh .. AK }\L f l 

and 6 


and estimates 


E(E kk .\BSn) = 


ISkk* + G kk* 

I + h — K — 2 


1 + $kk' 
1-2 


Sw 


£(<lhh*..t K \ES D ) = F 2 (h l h..j K yI,L f ) 

Computation takes order IK steps here. This model 
could, for example, use a single combined hair-color eye- 
color attribute to allow a correlation between people be- 
ing blond and blue-eyed. 


4.5 Fully Covariant Reals - Sr 

If we assume Sr that a set K of real-valued attributes 
follow the multivariate normal distribution, we replace 
the <r\ above with a model covariance matrix E kk* and 
s\ with a data covariance matrix 

Si**/ — j “ ®*') 

i 

. The E**/ must be symmetric, with E = E*.**, 
and “positive definite”, satisfying ^2 k k f yk'£kk*yk* > 0 
for any vector t/*. The likelihood for a set of attributes 
K is 7 


dL(Ei\VS R ) 


dN(Ei, {/it} , {Sfc*>} ,K) 

(2x)*|E M ,|* 



is the multivariate normal in K dimensions. 

As before, we choose a prior that takes the means to 
be independent of each other, and independent of the 
covariance 

d*(V\S R ) = d7r({S fc *.} \S R )U dir(n k \S R1 ), 

k 


so the estimates of the means remain the same, 
E(fjLk\ESR) = Xk • We choose the prior on E**' to use 
an inverse Wishart distribution [Mardia et ai , 1979] 


*r({Ew} | Sr) = dM%'({E kh .} | {Gjtfc/} , h) = 


m K±K -ii II 


2 7T 


nfr(i±i=4) 


k<k* 


which is normalized (integrates to 1) for h > K and 
Efcfc/ symmetric positive definite. This is a “conju- 
gate” prior, meaning that it makes the resulting poste- 
rior d7r({Efc*'} |i££jt) take the same mathematical form 


6 F\ and F 2 are defined on page 4. 

7 EiV denotes the matrix inverse of E a b satisfying 
^2 b EoYEbc = ^oc, and |S tt 6| denotes components of the ma- 
trix determinant of {Eat}. 


If we choose Gkk • too large it dominates the esti- 
mates, and if Gkk* is too small the marginal is too small. 
The compromise above should only over estimate the 
marginal somewhat, since it in effect pretends to have 
seen previous data which agrees with the data given. 
Note that the estimates are undefined unless I > 2. 
Computation here takes order (I + K)K 2 steps. At 
present, we lack a satisfactory way to approximate the 
above marginal when some values are unknown. 

4.0 Block Covariance - Sv 

Rather than just having either full independence or full 
dependence of attributes, we prefer a model space Sv 
where some combinations of attributes may covary while 
others remain independent. This allows us to avoid pay- 
ing the cost of specifying covariance parameters when 
they cannot buy us a significantly better fit to the data. 

We partition the attributes K into B blocks /C&, with 
full covariance within each block and full independence 
between blocks. Since we presently lack a model allowing 
different types of attributes to covary, all the attributes 
in a block must be of the same type. Thus real and 
discretes may not mutually covary. 

We are away of other models of partial dependence, 
such as the the trees of Chow and Liu described in [Pearl, 
1988], but choose this approach because it includes the 
limiting cases of full dependence and full independence. 

The evidence E partitions block-wise into E{Kb) (us- 
ing Ei(K) = U*€/c x ih and E{K) = {£,(£)}), each with 
its own sufficient statistics; and the parameters V parti- 
tion into parameters V h = {qi x i^A K } or , {^fc}]- 

Each block is treated as a different problem, except that 
we now also have discrete parameters T to specify which 
attributes covary, by specifying B blocks and {/C&} at- 
tributes in each block. Thus the likelihood 

B 

L{Ei\VTSv) = 

b 

is a simple product of block terms Sr = Sd or Sr assum- 
ing full covariance within each block, and the estimates 
£{Vb\ ETSv) = £{Vb\E{Kb)SB) are the same as before. 

We choose a prior which predicts the block structure 
B {Kb} independently of the parameters V& within each 
independent block 

d7r(VT|Sv) = t(B {£ 6 } \S V )U *t(H|SjO 

b 
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which results in a similarly decomposed marginal 

M{ET\Sv) = *(B {£*,} |5v)n M (^)|5B). 

b 


We choose a block structure prior 


it{B{K h } | Sv) = 1 /K r Z(Kr,B r )K d Z{K d ,B d ), 

where Kr is the set of real attributes and Br is the 
number of real blocks (and similarly for Kr and Br). 
This says that it is equally likely that there will be one 
or two or three, etc. blocks, and, given the number of 
blocks, each possible way to group attributes is equally 
likely. This is normalized using Z(A , Z7), given by 


U = 1 


which gives the number of ways one can partition a set 
with A elements into U subsets. This prior prefers the 
special cases of full covariance and full independence, 
since there are fewer ways to make these block combi- 
nations. For example, in comparing the hypothesis that 
each attribute is in a separate block (i.e., all indepen- 
dent) with the hypothesis that only one particular pair 
of attributes covary together in a block of size two, this 
prior will penalize the covariance hypothesis in propor- 
tion to the number of such pairs possible. Thus this 
prior includes a “significance test”, so that a covariance 
hypothesis will only be chosen if the added fit to the 
data from the extra covariance is enough to overcome 
this penalty. 

Computation here takes order NK{IKb 4 - K%) steps, 
where N is the number of search trials done before quit- 
ting, which would be around (K — 1)! for a complete 
search of the space. Kb is an average, over both the 
search trials and the attributes, of the block size of real 
attributes (and unity for discrete attributes). 


5 Class Mixtures 

5.1 Flat Mixtures - Sm 

The above model spaces Sc = Sv or Sj can be thought 
of as describing a single class, and so can be extended 
by considering a space Sm of simple mixtures of such 
classes [D.M.Titterington ei aZ., 1985]. Figure 1 shows 
how this model, with Sc = Si, can fit a set of artificial 
real- valued data in five dimensions. 

In this model space the likelihood 

c 

L(Ei\VTS M ) = acL{Ei\V c T c S c ) 

C 

sums over products of “class weights” a c , that give the 
probability that any case would belong to class c of the 
C classes, and class likelihoods describing how members 
of each class are distributed. In the limit of large C this 
model space is general enough to be able to fit any dis- 
tribution arbitrarily closely, and hence is “asymtotically 
correct” . 

The parameters T = [C, {T c }] and V = [{a c } , {V^}] 
combine parameters for each class and parameters de- 
scribing the mixture. The prior is similarly broken down 
as 

dAVT\S M ) = F 3 (C)Cl dB({a c } \C) JJ dAV e T e \S c ) 



Figure 1: AutoClass III Finds Three Classes 

We plot attributes 1 vs. 2, and 3 vs. 4 for an artificial data 
set. One <x deviation ovals are drawn around the centers of 
the three classes. 


where F$(C) = - 5—5 for C > 0 and is just one arbitrary 
choice of a broad prior over integers. The a e is treated 
as if the choice of class were another discrete attribute, 
except that a <7! is added because classes are not distin- 
guishable a priori. 

Except in very simple problems, the resulting joint 
dJ(EVT\S) has many local maxima, and so we must 
now focus on regions R of the V space. To find such 
local maxima we use the “EM” algorithm [Dempster et 
aZ., 1977] which is based on the fact that at a maxima 
the class parameters V c can be estimated from weighted 
sufficient statistics. Relative likelihood weights 

__ a c L(Ei\V c T c S c ) 
tc L(Ei\VTS M ) ’ 

give the probability that a particular case i is a member 
of class c. These weights satisfy t o,* c = 1, since every 

case must really belong to one of the classes. Using these 
weights we can break each case into “fractional cases”, 
assign these to their respective classes, and create new 
“class data” E c = \J ik [X,**, tu fC ] with new weighted-class 
sufficient statistics obtained by using weighted sums 
u>ic instead of sums For example I c = Wic, 

Xkc = hi—lKC = Yli w *c FI* * anc ^ 

A;cfc c = J"k A Substituting these statistics into 
any previous class likelihood function L(E\V c T c Sc) gives 
a weighted likelihood L'(E c \V c T c Sc) and associated new 
estimates and marginals. 

At the maxima, the weights w ic should be consistent 
with estimates of V = {[a c ,C c ]} from S(V C \ERS M ) = 
£'(V c \E c Sc) and £{a c \ERS M ) = F 2 (I C , J,C). To reach 
a maxima we start out at a random seed and repeatedly 
use our current best estimates of V to compute the ty, c , 
and then use the u/ lc to re-estimate the V , stopping when 
they both predict each other. Typically this takes 10 — 
100 iterations. This procedure will converge from any 
starting point, but converges more slowly near the peak 
than second-order methods. 

Integrating the joint in R can’t be done directly be- 
cause the product of a sum in the full likelihood is hard 
to decompose, but if we use fractional cases to approxi- 
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mate the likelihood 

c 

L(Ei\VTRS m ) = ^ a c L(Ei\V c T c Sc) 

C 

2 JJ (a c L{Ei\V c T c Sc)) Wic 

C 

holding the wu fixed, we get an approximate joint: 

M(ERT\S m ) 2 J a (C7)C!l\({Ie} ,I,C)J[M'(E c T\S c ) 

c 

Our standard search procedure combines an explicit 
search in C with a random search in all the other pa- 
rameters. Each trial begins converging from classes built 
around C random case pairs. The C is chosen randomly 
from a log-normal distribution fit to the Cs of the 6 — 10 
best trials seen so far, after trying a fixed range of Cs to 
start. We also have developed alternative search proce- 
dures which selectively merge and split classes according 
to various heuristics. While these usually do better, they 
sometimes do much worse. 

The marginal joints of the different trials generally 
follow a log-normal distribution, allowing us to estimate 
during the search how much longer it will take on average 
to find a better peak, and how much better it is likely 
to be. 

In the simpler model space Smi where Sc = Si the 
computation is order NICK , where C averages over the 
search trials. N is the number of possible peaks, out 
of the immense number usually present, that a compu- 
tation actually examines. In the covariant space Smv 
where Sc = Sv this becomes NKC(IKb + K$). 

5.2 Class Hierarchy and Inheritance - Sh 

The above class mixture model space Sm can be gener- 
alized to a hierarchical space Sh by replacing the above 
set of classes with a tree of classes. Leaves of the tree, 
corresponding to the previous classes, can now inherit 
specifications of class parameters from “higher” (closer 
to the root) classes. For the purposes of the parameters 
specified at a class, all of the classes below that class 
pool their weight into one big class. Parameters associ- 
ated with “irrelevant” attributes are specified indepen- 
dently at the root. Figure 2 shows how a class tree, this 
time with Sc = can better fit the same data as in 
Figure 1. See [Hanson et ah, 1991] for more about this 
comparison. 

The tree of classes has one root class r. Every other 
class c has one parent class P c , and every class has J c 
child classes given by C c j , where the index j ranges over 
the children of a class. Each child class has a weight 
a c j relative to its siblings , with a cj = 1? and an 
absolute weight ac cj = <*cj<Xc, with a r = 1. 

While other approaches to inheritance are possible, 
here each class is given an associated set of attributes 
K c , which it predicts independently through a likeli- 
hood L(Ei(/C c )\V c T c Sc) and which no class above or be- 
low it predicts. To avoid having redundant trees which 
describe the same likelihood function, only fc r can be 
empty, and non- leaves must have J c > 2. 

We need to ensure that all attributes are predicted 
somewhere at or above each leaf class. So we call Ac 



Figure 2: AutoClass IV Finds Class Tree x 10 12 ° Better 

Lists of attribute numbers denote covariant blocks within 
each class, and the ovals now indicate the leaf classes. 


the set of attributes which are predicted at or below 
each class, start with At = K , and then recursively par- 
tition each Ac into attributes K c “kept” at that class, 
and hence predicted directly by it, and the remaining 
attributes to be predicted at or below each child Ac ci . 

For leaves Ac = tC c - 

Expressed in terms of the leaves the likelihood is again 
a mixture: 

L(Ei\VTS M ) = <* c II L(E i (K c ,)\V c ,T c ,Sc) 

c:J c =0 c* —CyP ciPp c ,...,r 

allowing the same EM procedure as before to find local 
maximas. The case weights here w c { = Y,j C w c cj i (with 
w r i = 1) sum like in the flat mixture case and define 
class statistics E C (K C ) = (J keK e ,i [-Xife? «;«•]. 

We also choose a similar prior, though it must now 
specify the K c as well: 

d*(VT\S H ) = 

H d*(J e K c | AcS h )Jc'. dB([ja cj \J c )dw(V c T c \ K C S C ) 

« 3 

| AS„) = w - 

for all subsets K c of Ac of size in the range [1 — 6 cr , A c ], 
except that P 3 ( J c — 1) is replaced by Soj c when Ac — £ c * 
Note that this prior is recursive, as the prior for each 
class depends on the what attributes have been chosen 
for its parent class. 

This prior says that each possible number of attributes 
kept is equally likely, and given the number to be kept 
each particular combination is equally likely. This prior 
prefers the simpler cases of K c = Ac and K c = 1 and so 
again offers a significance test. In comparing the hypoth- 
esis that all attributes are kept at class with a hypothesis 
that all but one particular attribute will be kept at that 
class, this prior penalizes the all-but-one hypothesis in 
proportion to the number of attributes that could have 
been kept instead. 

The marginal joint becomes 

M(ERT\S h ) S 

U d*{J c Kc I AcShW-F^Ic^Ic, Jc)M'(E e (K e )T c \Sc) 

c i 
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and 

estimates are again £(V c \ERSh) = £ f (V c \E c (K c )Sc ) 
and £ (oc c j \ERSh) = ^2(Zcji Ze» Jc)- 
In the general case of Sj/v, where 5c = SV, computa- 
tion again takes NKC{IKb + )» ^ ce P^ that the J is 

now also an average of, for each k, the number of classes 
in the hierarchy which use that k (i.e., have k € /C c ). 
Since this is usually less than the number of leaves, the 
model Sh is typically cheaper to compute than Sm for 
the same number of leaves. 

Searching in this most complex space Shv is challeng- 
ing. There are a great many search dimensions where one 
can trade off simplicity and fit to the data, and we have 
only begun to explore possible heuristics. Blocks can be 
merged or split, classes can be merged or split, blocks 
can be promoted or demoted in the class tree, EM itera- 
tions can be continued farther, and one can try a random 
restart to seek a new peak. But even the simplest ap- 
proaches to searching a more general model space seem 
to do better than smarter searches of simpler spaces. 

6 Conclusion 

The Bayesian approach to unsupervised classification de- 
scribes each class by a likelihood function with some free 
parameters, and then adds in a few more parameters to 
describe how those classes are combined. Prior expecta- 
tions on those parameters VT combine with the evidence 
E to produce a marginal joint M(ERT\S) which is used 
as an evaluation function for classifications in a region 
R near some local maxima of the continuous parameters 
V and with some choice of discrete model parameters T. 
This evaluation function optimally trades off the com- 
plexity of the model with its fit to the data, and is used 
to guide an open-ended search for the best classification. 

In this paper we have applied this theory to model 
spaces of varying complexity in unsupervised classifies 
tion. For each space we provides a likelihood, prior, 
marginal joint, and estimates. This should provide 
enough information to allow anyone to reproduce Au- 
toClass, or to use the same evaluation functions in other 
contexts where these models might be relevant. 
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